A Search Engine in Perl

Max Maischein

Frankfurt.pm

Overview

Motivation
Structure of a search engine
Ingredients
Demo
Future improvements

Who am I?

Max Maischein
Frankfurt.pm
Perl since 2000
Financial regulatory regimes since 2013 (EMIR, EinSiG, MIFiD II, ...)
Smart Data + data mining since 2016

Too much information

Too much different information
Too little time to organize the information
Different from Google
Keep data local
Don't become part of a resultset

Existing approaches

Google Search Appliance (too expensive)

Existing approaches

Google Search Appliance (too expensive)
Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)

Existing approaches

Google Search Appliance (too expensive)
Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)
Siri+Sherlock (Mac) (No Mac)

Existing approaches

Google Search Appliance (too expensive)
Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)
Siri+Sherlock (Mac) (No Mac)
Beagle for Linux or Ubuntu (Stopped in 2009)

Do it yourself

Otherwise there would be no talk for me
Little time
Reuse many available building blocks

Splitting the task

Scraper / Crawler

Find documents
Find linked documents
Extract text
Import text
Metadata: Text language / URL / Creation time stamp

Search Index

Optimized data structure
Quick retrieval
Stemming (Find "Programs" and "Programming" when searching for "Program")
Synonyms

Search

Query entry
Quick (!) response
Ranking
Preview of document

Parts

Crawler / Extractor (Perl+Apache Tika)

Parts

Crawler / Extractor (Perl+Apache Tika)
Index (Elasticsearch, Search::Elasticsearch)

Parts

Crawler / Extractor (Perl+Apache Tika)
Index (Elasticsearch, Search::Elasticsearch)
Search (Dancer)

Live Demo

 1:  cpanm --look Dancer::SearchApp
 2:  plackup -Ilib -p 5001 --host 127.0.0.1 -a bin/app.pl &
 3:  
 4:  perl -Ilib -w bin/index-filesystem.pl t\documents
 5:
 6:  # Search

ES Schema

URL / id ( file:// or mail:// )
title
body (HTML)
author
type (file or mail)

Crawlers / Extractors

File system (pdf, Text, Audio, via Apache::Tika::Async)
IMAP
ICal
HTTP (also, Plack)

Comparison with Google

Pagerank vs. Elasticsearch rank
Pagerank recognizes "Hub" pages
Every document on MY laptop is "interesting"

Search Results

We can display local content in local formats

PDF (as HTML)

Search Results

We can display local content in local formats

PDF (as HTML)
Mail (link to Thunderbird)

Search Results

We can display local content in local formats

PDF (as HTML)
Mail (link to Thunderbird)
Music (direct link)

Future improvements

Extraction from Online-Content (Intranet, HTML::ContentExtractor::FTR)
More extractors (video, ...)
Metasearch actross Elasticsearch instances (Laptop in home network)

Installation

Apache Tika

https://tika.apache.org/download.html

 1:  http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.13.jar

ElasticSearch

https://www.elastic.co/downloads/elasticsearch

 1:  https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.2.0/elasticsearch-2.2.0.zip

Thanks

Questions?

Thanks

Questions?

Dancer::SearchApp

corion@cpan.org

Credits

 1:  Hitman Kevin MacLeod (incompetech.com)
 2:  Licensed under Creative Commons: By Attribution 3.0 License
 3:  http://creativecommons.org/licenses/by/3.0/

Google Search Appliance image by Google Inc.

Cortana image by Microsoft Inc.

Apple Siri logo by Apple Inc.

Beagle logo by Fornax / Beagle Project

 1:  https://de.wikipedia.org/wiki/Datei:Beagle_Logo.svg

Bonus section

Content / Testdata

My own mails
Trip back to 2000
No good for public consumption
EU/ESMA produces many PDFs
I produce many Perl programs
YAPC / Act produces many calendars

Crawlers

The heart of the search engine
Content extraction
Much existing code

Development

Filesystem Crawler

File::Find
Apache::Tika::Async for text extraction
Special extraction for mp3 and images
Done

IMAP / Mail crawler

Not good for presentation
Trip back to 2001
Start with file import
index-imap.pl

25. August 2016

A Search Engine in Perl

A Search Engine in Perl

Max Maischein

Frankfurt.pm

Overview

Who am I?

Too much information

Existing approaches

Existing approaches

Existing approaches

Existing approaches

Do it yourself

Splitting the task

Scraper / Crawler

Search Index

Search

Parts

Parts

Parts

Parts

Live Demo

ES Schema

Crawlers / Extractors

Comparison with Google

Search Results

Search Results

Search Results

Future improvements

Installation

Thanks

Thanks

Credits

Bonus section

Content / Testdata

Crawlers

Development

Filesystem Crawler

IMAP / Mail crawler